As we have \(31\)-dimensional data we will perform preliminary analysis of the variables before running classification. The following is a loose structure of our method for tackling this problem:
We have a few types of data to consider:
Variables KaggleSet and KaggleWeight can be ignored as they denote which data points were in the provided Kaggle challenge and their relative weights, which is irrelevant to us.
The discrete classification variable Label which takes values in \(\{b,s\}\)
The continuous variable Weight, which will be used to compute the AMS, and will not be used in classification.
The discrete variable PRI_jet_num denotes the amount of jets from each event and takes values in \(\{0,1,2,3\}\).
All other variables are either direct or indirect measurements and are continuous.
Since we are classifying, we want to remove any redundancies in our data, fortunately we have the following co-dependencies:
PRI_jet_all_pt = PRI_jet_leading_pt + PRI_jet_subleading_pt
PRI_lep_pt = DER_pt_ratio_lep_tau*PRI_tau_pt
Jet-related variables depend on PRI_jet_num (see next section for more)
We can either remove PRI_jet_all_pt or remove both PRI_jet_leading_pt and PRI_jet_subleading_pt to reduce the dimension by \(1\) or \(2\). This same idea can be applied to to the other (multiplicative) dependence. With these reductions we could remove up to \(4\) dimensions from our data.
If we wish to be more liberal with our removal, we could throw out all jet related variables besides PRI_jet_num to remove \(10\) more dimensions from our data. However this is assuming that the other jet variables have little impact on classification which has not been validated - yet.
Below are the variables which contain undefined values. By convention it was provided to use with values -999 which is out of range for every observation.
Notice that every column besides DER_mass_MMC is a jet variable, and in those, we only have two values for completion_rate. In fact these correspond to different values of PRI_jet_num:
0 jets: All jet variables were missing data.
1 jet: only PRI_leading_pt,PRI_leading_eta and PRI_leading_phi have data.
2 or more jets: All jet variables have data.
The above result can be found in the handbook for the variables provided with the Kaggle challenge. Unfortunately DER_mass_MMC does not have such an explanation and may be a result of some event during measurement. However it is still assigned the same -999 value as the other missing data.
Despite understanding the cause of (most of) our missing data, we still don’t know the impact, the following section is dedicated to understanding the correlation between the missing data from each variable and its classification. Take DER_mass_MMC for example, if we find that the missing data is distributed uniformly across \(y_i \in \{b,s\}\) then the appearance of NA has no impact on classification, however if it is disproportionately skewed towards \(b\) then we can use this to assist our classification.
Below we compute the percentage of NA data in Background and Signal respectively. If this missing data is distributed uniformly across Background and Signal then the percentages should be very close.
Notice that Signal consistently has \(13-18 \%\) less missing data than Background. This means that in both the DER_mass_MMC and jet-variables cases, the amount of missing data is indicative of a classification. Meaning that including -999 (or some other extreme value) can help with classification and will not be removed.
To get an understanding of the behavior of variable given their classification, we split each variable into whether each output was assigned \(y=b\) or \(y=s\) respectively. The entire list can be found in the appendix but here are a few of note:
Note that the data for these plots the variables have NA instead of -999 for missing data, again see appendix for plots with -999 included. Some key takeaways from these plots are:
Many of our parameters are over the range \([-\pi,\pi]\) as they correspond to the angles in which detections are made.
We see that some these angle variables: PRI_tau_phi,PRI_lep_phi, PRI_met_phi,PRI_jet_leading_phiand PRI_jet_subleading_phi are approximately uniformly distributed and that there is very little variation between the classifications. Additionally all except PRI_jet_subleading_phi are unaffected when we include the -999 values.
We have many single peak distributions, some of which have clear translations in their peaks: DER_mass_MMC,DER_mass_transverse_met_lep,DER_mass_vis,PRI_met_sumet and DER_pt_ratio_lep_tau.
We can see that many Jet-related variables are dominated by their -999 values.
Looking at these plots, there are a few contenders for important variables for classification. The distributions with translated peaks are good examples of this. Additionally the angle variables which are largely uniform are likely to only constribute noise and not help with classification. Before we decide on our preferences of variables, we employ an information theoretic technique to measure variation between the two distributions: The Mutual Information. ## The Mutual Information between \(b\) and \(s\)
Below we compute the mutual information between \(V \mid y=s\) and \(V \mid y= b\), which we denote \(V_s\) and \(V_b\) respectively.
Below we plot the joint and product distributions for SOME_VARIABLE as a visual aid to how the mutual information is computed.
Below we display the mutual information of all of our variables.
These results justify many of our direct observations from plotting the densities: - The variables with distinct peaks (DER_mass_MMC,DER_mass_transverse_met_lep,DER_mass_vis,PRI_met_sumet and DER_pt_ratio_lep_tau) had very high mutual information. Backing up our intuition for using them in classification. - Surprisingly many of the jet-variables still had very low mutual information, even lower than the uniform angle variables.
FDA creates an embedding that tells us how linearly separated our data can possibly be by maximizing between class scatterness and minimizing within class scatterness. This can be a useful aid to find out whether a linear classifier is appropriate and can additionally be used to see how removing variables affects the separability of our data.
We consider four increasing cases of variables elimination: 1) Including all variables; 2) Removing all 4 algebraic co-dependencies; 3) Removing the above and any uniform variables; 4) Removing the above and the 10 variables with lowest mutual information. The following plots show the results of the FDA embedding for each of these cases.
For these different FDA embeddings, we can compare the ratio between between class scatterness and within class scatterness:
We find that removing variables slightly lowers the linear separability of the data, but in general the values are comparable. Based on the plots, we can conclude that it is not possible to achieve a high degree of linear separability between the signal and background events, indicating that a feature transform is most likely appropriate.
We divided our dataset into a training and testing split, training SVMs on the training data and finding the AMS on the testing data.
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
## # Was:
## data %>% select(mut_info_vars)
##
## # Now:
## data %>% select(all_of(mut_info_vars))
##
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
By training SVMs with a linear kernel vs. a radial kernel on a 20% training set (using all predictor variables), we found that a radial kernel consistently yielded higher AMS:
Based on these results, along with the results of FDA indicating that the dataset is not linearly separable, we moved forward using a radial kernel. We performed 5-fold cross-validation over the 20% training set to tune the two hyperparameters using a grid search: cost and \(\gamma\) (bandwidth). We allowed cost to take on the values 0.5, 1, and 2 and \(\gamma\) to take on the values 0.01, 0.1, and 0.5.
NA entriesNA entries-999 entries